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The  present  study  was  an  attempt  to  alleviate  some  of  the  difficulties 
inherent  in  multiple-choice  items  by  having  examinees  respond  to  multiple- 
choice  items  in  a  probabilistic  manner.  Using  this  format,  examinees  are  able 
to  respond  to  each  alternative  and  to  provide  indications  of  any  partial 
knowledge  they  may  possess  concerning  the  item.  The  items  used  in  this  study 
were  30  multiple-choice  analogy  items.  Examinees  were  asked  to  distribute  100 
points  cuiong  the  four  alternatives  for  each  item  according  to  how  confident 
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they  were  that  each  alternative  was  the  correct  answer.  Each  item  was  scored 
using  five  different  scoring  formulas.  Three  of  these  scoring  formulas — the 
spherical,  quadratic,  and  truncated  log  scoring  methods — were  reproducing 
scoring  systems.  The  fourth  scoring  method  used  the  probability  assigned  to 
the  correct  alternative  as  the  item  score,  and  the  fifth  used  a  function  of 
the  absolute  difference  between  the  correct  response  vector  for  the  four 
alternatives  and  the  actual  points  assigned  to  each  alternative  as  the  item 
score.  Total  test  scores  for  all  of  the  scoring  methods  were  obtained  by 
summing  individual  item  scores. 

Several  studies  using  probabilistic  response  methods  have  shown  the  effect  of 
a  response-style  variable,  called  certainty  or  risk  taking,  on  scores  obtained 
from  probabilistic  responses.  Results  from  this  study  showed  a  small  effect 
of  certainty  on  the  probabilistic  scores  in  terms  of  the  validity  of  the 
scores  but  no  effect  at  all  on  the  factor  structure  or  internal  consistency  of 
the  scores.  Once  the  effect  of  certainty  on  the  probabilistic  scores  had  been 
ruled  out,  the  five  scoring  formulas  were  compared  in  terms  of  validity, 
reliability,  and  factor  structure.  There  were  no  differences  in  the  validity 
of  the  scores  from  the  different  methods,  but  scores  obtained  from  the  two 
scoring  formulas  that  were  not  reproducing  scoring  systems  were  more  reliable 
and  had  stronger  first  factors  then  the  scores  obtained  using  the  reproducing 
scoring  systems.  For  practical  use,  however,  the  reproducing  scoring  systems 
may  have  an  advantage  because  they  maximize  examinees'  scores  when  examinees 
respond  honestly,  while  honest  responses  will  not  necessarily  maximize  an 
examinee's  score  with  the  other  two  methods.  If  a  reproducing  scoring  system 
is  used  for  this  reason,  the  spherical  scoring  formula  is  recommended,  since 
it  was  the  most  internally  consistent  and  showed  the  strongest  first  factor  of 
the  reproducing  scoring  systems. 
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LFFECT  OF  tlXAMINEE  CERTAINTY  ON  PROBABILISTIC  TEST  SCORES 

and  a  Comparison  of  Scoring  Methods  for  Probabilistic  Responses 


Psychometricians  have  searched  for  many  years  for  a  test  item  format  that 
would  allow  them  to  measure  individual  differences  on  a  variable  of  interest  as 
accurately  and  as  completely  as  possible.  The  multiple-choice  item  has  proven 
to  be  a  useful  tool  for  assessing  kowledge,  but  there  are  several  problems  with 
this  item  format.  These  problems  include  the  possibility  of  an  examinee  guess¬ 
ing  the  correct  answer,  the  lack  of  information  concerning  the  process  used  by 
an  examinee  to  obtain  a  given  answer,  and,  in  general,  an  inability  to  accurate¬ 
ly  determine  an  examinee's  level  on  a  continuous  underlying  trait  based  on  an 
observable  dichotomous  response. 

In  attempts  to  remedy  these  problems  and  to  extract  the  maximum  amount  of 
information  from  an  individual's  responses  to  a  set  of  test  items.  Lord  and  No- 
vick  (1968,  Chap.  14)  have  identified  three  important  components  of  interest. 
These  components  are 

1.  The  measurement  procedure,  or  the  manner  in  which  examinees  are  in¬ 
structed  to  respond  to  the  items. 

2.  The  item  scoring  formula. 

3.  The  method  of  weighting  each  item  to  form  a  total  score. 

In  their  attempts  to  find  alternatives  to  the  conventional  multiple-choice  item 
where  the  examinee  is  instructed  to  choose  the  one  best  answer  to  an  item  from  a 
number  of  alternatives,  investigators  have  generally  focused  on  one  or  two  of 
these  components  at  a  time. 

The  various  attempts  to  improve  upon  the  traditional  multiple-choice  item 
can  be  classified  into  three  broad  categories:  (1)  attempts  to  improve  the  mul¬ 
tiple-choice  item  by  using  an  item-weighting  formula  other  than  the  conventional 
unit-weighting  scheme,  (2)  variations  of  the  multiple-choice  item  that  attempt 
to  provide  more  information  about  an  examinee's  ability  level  by  asking  the  ex¬ 
aminee  to  respond  to  a  traditional  multiple-choice  item  in  a  manner  other  than 
simply  choosing  the  one  best  alternative,  and  (3)  the  use  of  item  types  which 
are  completely  different  from  the  conventional  multiple-choice  item,  such  as 
f ree-response  items.  The  first  category  focuses  on  the  third  component  enumer¬ 
ated  by  Lord  and  Novick,  the  item-weighting  formula.  The  second  category  fo¬ 
cuses  on  Lord  and  Novick' s  first  two  components — the  measurement  procedure  and 
item-scoring  formulas — while  continuing  to  use  a  unit -weigh ting  scheme  to  com¬ 
bine  item  scores  into  a  total  score.  The  third  category  focuses  primarily  on 
the  measurement  procedure  and,  to  a  lesser  extent,  on  item  scoring  formulas. 

Item-Weighting  Formulas 

For  many  years  the  accepted  method  of  combining  item  scores  to  form  a  test 
score  was  simply  to  sum  all  of  the  individual  item  scores.  Since  this  procedure 
is  equivalent  to  multiplying  each  item  score  by  an  item  weight  of  1  and  then 
summing  the  weighted  item  scores,  the  method  has  been  called  unit  weighting.  In 
attempts  to  increase  the  validity  and/or  the  reliability  of  test  scores  obtained 
by  summing  item  scores,  many  researchers  have  abandoned  unit  weighting  in  favor 
of  various  forms  of  differential  weighting  of  individual  items.  These  methods 


of  differential  weighting  of  items  include  multiple  regression  techniques  (Wes- 
man  &  Bennett,  1959),  using  the  validity  coefficient  of  the  item  as  the  item 
weight  (Guilford,  1941),  weighting  items  by  the  reciprocal  of  the  item  standard 
deviation  (Terwilliger  &  Anderson,  1969),  a  priori  item  weights  (Burt,  1950), 
and  numerous  other  weighting  procedures  (Bentler,  1968;  Dunnette  &  Hogatt,  1957; 
Hendrickson,  1970;  Horst,  1936;  Wilks,  1938). 

In  reviewing  the  substantial  literature  in  this  area,  Wang  and  Stanley 
(1970,  p.  664)  have  concluded  that  “although  differential  weighting  theoretical¬ 
ly  promises  to  provide  substantial  gains  in  predictive  or  construct  validity,  in 
practice  these  gains  are  often  so  slight  that  they  do  not  seem  to  justify  the 
labor  involved  in  deriving  the  weights  and  scoring  with  them.  This  is  especial¬ 
ly  true  when  the  component  measures  are  test  items  ...."  Gulliksen  (1950)  con¬ 
cluded,  in  concurrence  with  Wang  and  Stanley  (1970),  that  differential  weighting 
is  not  worthwhile  when  a  test  contains  more  than  approximately  10  items  and  when 
the  items  are  highly  correlated.  Stanley  and  Wang  (1970),  after  concluding  that 
differential  item  weighting  is  not  a  fruitful  venture  for  test  items,  have  sug¬ 
gested  that  the  item  score  be  determined  by  the  response  made  to  an  item,  where 
the  examinee  is  required  to  do  more  than  just  select  the  correct  alternative  for 
an  item.  By  changing  the  mode  of  response  and  devising  item  scoring  formulas 
appropriate  for  these  types  of  responses,  the  validity  and/or  reliability  of 
test  scores  might  be  increased.  An  additional  gain  might  be  more  insight  into 
the  process  involved  in  responding  to  test  items. 

Variations  of  the  Response  Format  of  Multiple-Choice  Items 

Several  of  the  earliest  attempts  at  modification  of  the  method  of  respond¬ 
ing  to  a  conventional  multiple-choice  item  were  reported  by  Dressel  and  Schmid 
(1953)  in  an  investigation  of  various  item  types  and  scoring  formulas.  A  con¬ 
ventional  multiple-choice  test  and  one  of  four  "experimental  test  forms"  were 
administered  to  each  subject.  The  items  in  each  of  the  experimental  test  forms 
resembled  conventional  multiple-choice  items  in  that  an  item  stem  and  several 
alternatives  were  provided,  but  each  experimental  test  form  differed  from  the 
conventional  multiple-choice  format  in  the  following  ways: 

1.  Free-choice  format.  Examinees  were  instructed  to  choose  as  many  of  the 
alternatives  provided  as  necessary  to  insure  that  they  had  chosen  the 
correct  alternative.  This  itsn  format  was  scored  using  Equation  1, 
which  yields  integer  scores  that  range  from  -4  to  4  and  applies  only  to 
five-alternative  items: 

Item  score  =  4C  -  I  [1] 

where  C  *  number  of  correctly  marked  alternatives  and 
I  =  number  of  incorrectly  marked  alternatives. 

2.  Degree-of-certainty  test.  Examinees  were  instructed  to  choose  the  one 
best  answer  for  an  item  and  then  to  choose  one  of  four  confidence  rat¬ 
ings  provided  to  indicate  the  degree  of  confidence  they  had  in  the  an¬ 
swer  they  had  chosen.  This  item  format  was  scored  as  shown  in  Table  1. 

3.  Multiple-answer  format.  Each  item  contained  more  than  one  correct  al¬ 
ternative,  and  the  examinees  were  instructed  to  choose  all  of  the  cor¬ 
rect  alternatives.  The  score  for  this  format  was  the  number  of  correct 
alternatives  chosen  minus  a  correction  factor  for  any  incorrect  alter¬ 
natives  chosen. 


Table  1 

Scoring  System  for  Degree-of-Certainty  Test 


Item 

Score 

Correct 

Incorrect 

Answer 

Answer 

Confidence  Rating 

Chosen 

Chosen 

Positive 

A 

-A 

Fairly  certain 

3 

-3 

Rational  guess 

2 

-2 

No  defensible  basis  for  choice 

1 

-1 

4.  Two-answer  format.  Each  item  contained  exactly  two  correct  alterna¬ 
tives,  and  the  examinees  were  instructed  to  indicate  both  of  the  cor¬ 
rect  alternatives.  The  item  score  was  simply  the  number  of  correct 
alternatives  chosen. 

In  comparing  these  five  test  forms  (the  conventional  multiple-choice  format 
and  the  four  experimental  test  formats),  Dressel  and  Schmid's  (1953)  results 
showed  that  the  experimental  test  formats  containing  more  than  one  correct  al¬ 
ternative  (Formats  3  and  4  above)  exhibited  greater  internal  consistency  reli¬ 
ability  than  the  other  three  test  forms,  but  these  test  formats  also  took  longer 
to  administer  than  all  of  the  other  formats.  All  of  the  experimental  test  for¬ 
mats  had  higher  internal-consistency  reliability  than  the  conventional  multiple- 
choice  test  except  for  the  free-choice  format,  but  the  conventional  multiple- 
choice  format  took  less  time  than  any  of  the  experimental  test  formats.  Al¬ 
though  the  higher  reliability  coefficients  of  several  of  these  formats  (Formats 
2,  3,  and  A)  might  suggest  that  these  formats  aid  in  introducing  more  ability 
variance  than  error  variance,  the  authors  warn  that  the  results  must  be  viewed 
with  caution,  since  there  were  statistically  significant  differences  between  the 
groups  taking  each  experimental  form  on  the  standard  multiple-choice  test  that 
was  administered  to  all  of  their  subjects;  thus,  the  differences  attributed  to 
the  effect  of  test  format  might  be  due  to  systematic  ability  differences  in  the 
groups  taking  each  of  the  experimental  test  formats. 

Hopkins,  Hakstian,  and  Hopkins  (1973)  used  a  confidence  weighting  procedure 
similar  to  the  degree-of-certainty  test  used  by  Dressel  and  Schmid  (1953)  and 
reported  higher  split-half  reliability  coefficients  for  the  confidence  weighting 
format  than  for  a  conventional  multiple-choice  test  using  the  same  items.  Hop¬ 
kins  et  al.  (1973)  also  reported  validity  coefficients  that  were  correlations 
between  the  test  scores  and  a  short-answer  form  of  the  same  test.  The  validity 
coefficient  for  the  conventional  test  (.70)  was  higher  but  not  significantly 
different  from  that  of  the  confidence  weighting  format  (.67). 

Coombs  (1953)  felt  that  examinees  could  provide  more  information  about  the 
degree  of  knowledge  they  possessed  by  eliminating  the  alternatives  which  they 
felt  were  incorrect,  rather  than  by  choosing  the  one  correct  alternative.  Items 
using  this  format  were  scored  by  assigning  one  point  for  each  incorrect  alterna¬ 
tive  eliminated  and  1  -  K  points  when  the  correct  alternative  was  eliminated, 
where  K  is  the  number  of  alternatives  provided.  This  scoring  system  yields  a 


range  of  integer  item  scores  from  -3  to  3  for  a  four-alternative  multiple-choice 
item. 

In  comparing  this  test  format  with  a  conventional  multiple-choice  test, 
Coombs,  Milholland  and  Womer  (1956)  found  no  differences  in  validity  between  the 
two  formats  for  separate  tests  of  vocabulary,  spatial  visualization,  and  driver 
information.  The  validity  coefficients  used  were  correlations  between  test 
scores  and  criteria  such  as  Stanford-Binet  IQ,  another  test  of  spatial  ability, 
and  subtest  scores  from  the  Differential  Aptitude  Test.  For  these  same  content 
areas,  the  experimental  test  format  yielded  higher  reliability  estimates  than 
the  conventional  test,  but  the  differences  between  the  estimates  were  not  sta¬ 
tistically  significant  for  any  of  the  content  areas.  One  result  in  favor  of  the 
experimental  test  format  was  that  the  subjects  in  the  experiment  felt  the  exper¬ 
imental  format  to  be  fairer  than  the  conventional  format. 

Another  variation  upon  the  conventional  multiple-choice  item  includes  a 
self-scoring  method  advocated  by  Gilman  and  Ferry  (1972),  which  requires  examin¬ 
ees  to  choose  among  alternatives  provided  until  the  correct  alternative  is  cho¬ 
sen.  Feedback  is  given  after  each  choice  is  made.  The  item  score  is  simply  the 
number  of  responses  needed  to  choose  the  correct  alternative;  thus,  a  higher 
score  indicates  less  knowledge  about  an  item.  Kane  and  Moloney  (1974)  have 
warned  that  although  Gilman  and  Ferry  (1972)  found  an  increase  in  split-half 
reliability  using  this  technique,  the  effect  of  using  this  method  on  the  reli¬ 
ability  of  the  test  depends  upon  the  ability  of  the  distractors  to  discriminate 
between  examinees  of  varying  levels  of  ability.  An  increase  in  reliability  will 
result  when  the  distractors  possess  this  ability  to  discriminate  among  ability 
levels,  but  no  increase  in  reliability  will  occur  if  this  is  not  the  case. 

Use  of  Subjective  Probabilities  with  Multiple-Choice  Items 

A  modification  of  the  traditional  multiple-choice  item  that  has  generated 
much  research  and  interest  is  the  use  of  examinees’  subjective  probabilities 
concerning  the  degree  of  correctness  of  each  alternative  provided  for  an  item  as 
a  method  of  assessing  the  degree  of  knowledge  or  ability  possessed  by  the  exam¬ 
inees.  By  assigning  a  probability  estimate  for  each  alternative  to  an  item, 
examinees  can  indicate  degrees  of  partial  knowledge  they  may  have  concerning 
each  alternative  for  an  item. 

To  simplify  this  procedure  for  examinees,  a  number  of  methods  have  been 
devised  to  aid  examinees  in  assigning  their  subjective  probabilities  to  the  al¬ 
ternatives.  One  method  is  to  ask  examinees  to  directly  assign  probabilities 
from  0  to  1.00  to  each  alternative,  with  the  restriction  that  the  probabilities 
assigned  to  all  of  the  alternatives  for  each  item  sum  to  1.00.  Another  method 
instructs  examinees  to  distribute  100  points  among  the  alternatives  for  each 
item.  The  distributed  points  are  then  converted  to  probabilities  for  scoring 
purposes  by  dividing  the  points  assigned  to  each  alternative  by  100.  Some  in¬ 
vestigators  have  used  fewer  points  for  distribution  (Rippey,  1970)  or  symbols, 
such  as  a  certain  number  of  stars,  which  are  to  be  distributed  among  the  alter¬ 
natives  (deFinetti,  1965),  but  the  concept  is  the  same. 

Using  these  types  of  measurement  procedures  (sometimes  called  probabilistic 
item  formats  or  probabilistic  response  formats),  an  item  scoring  formula  had  to 


be  devised  so  that  examinees'  expected  scores  would  be  maximized  only  when  they 
responded  according  to  their  actual  beliefs  concerning  the  correctness  of  each 
alternative.  Item-scoring  formulas  which  satisfy  these  conditions  are  called 
reproducing  scoring  systems  (RSS).  Shuford,  Albert,  and  Massengill  (1966)  and 
deFinetti  (1965)  provide  examples  of  several  RSSs.  The  RSSs  presented  by  these 
two  authors  for  use  with  multiple-choice  items  that  have  more  than  two  alterna¬ 
tives  and  only  one  correct  answer  are  the  following: 

1 .  Spherical  RSS 


Item  score  =  p 


k'A 


k=l 


[2] 


where  pc  =  probability  assigned  to  the  correct  alternative 

pj.  =  probability  assigned  to  alternative  ]c,  k  =  (1,  2, 
2.  Quadratic  RSS 


m) 


m 

Item  score  =  2p  -  I  (p  - )  [31 

C  k=l  * 

3.  Truncated  Logarithmic  Scoring  System 

fl  +  log (p  )  ,  .01  <  p  <_  l.OOj 

Ttom  score  =  \  ,  [4] 

1-1  ,  0  <  p  <  .01) 

~  c  — 

or  a  modification  of  this  scoring  function: 

(  [2  +  log(p  )  / 2]  ,  .01  ^  p  £  1.00  | 

Item  score  =  .  C  C  /  [5] 

(  0  ,  0<pc<  .01) 

The  truncated  logarithmic  scoring  system  is  technically  not  an  RSS,  but  it  does 
have  the  properties  of  an  RSS  for  probabilities  between  .027  and  .973.  Accord¬ 
ing  to  Shuford  et  al.  (1966),  when  examinees  believe  that  an  alternative  has  a 
probability  of  being  the  correct  answer  less  than  or  equal  to  .027,  their  score 
will  be  maximized  by  assigning  a  probability  of  zero  to  that  alternative.  Al¬ 
ternatively,  when  examinees  believe  that  an  alternative  has  a  probability  great¬ 
er  than  or  equal  to  .973,  their  expected  score  will  be  maximized  by  assigning  a 
probability  of  1.00  to  that  alternative.  Shuford  et  al.  (1966)  stated  that  "for 
extreme  values  of  (p^),  some  information  about  the  student's  degree-of-belief 

probabilities  is  lost,  but  from  the  point  of  view  of  applications,  the  loss  in 
accuracy  is  insignificant"  (p.  137).  Note  also  that  the  truncated  logarithmic 
scoring  function  is  the  only  one  of  the  scoring  formulas  that  is  dependent  only 
upon  the  probability  assigned  to  the  correct  alternative. 

Total  test  scores  for  examinees  are  obtained  for  all  of  the  RSSs  by  simply 
summing  the  individual  item  scores  obtained  using  that  particular  scoring  formu¬ 
la.  In  addition  to  the  conditions  expressed  above  for  an  RSS,  deFinetti  (1965) 
has  stated  that  the  validity  of  any  reproducing  scoring  system  also  rests  upon 
the  following  assumptions: 
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1.  The  examinees  are  capable  of  assigning  numerical  values  to  their  sub¬ 
jective  probabilities. 

2.  The  examinees  are  trained  in  using  the  response  format  and  understand 
the  scoring  system  to  be  used  in  scoring  the  items. 

3.  The  examinees  are  motivated  to  do  their  best  on  the  items. 

Rippey  (1968)  reported  results  from  several  studies  comparing  test  scores 
obtained  using  the  spherical  RSS  and  the  modification  of  the  truncated  logarith¬ 
mic  scoring  functions  with  test  scores  obtained  by  summing  dichotomous  (0,1) 
item  scores  to  conventional  multiple-choice  items.  In  general,  he  found  in¬ 
creases  in  Hoyt's  reliability  coefficient  using  a  probabilistic  response  format 
with  RSSs  under  limited  conditions.  The  probabilistic  test  format  produced  in¬ 
creases  in  test  reliability  with  undergraduate  college  students  but  could  not  be 
used  with  fourth  graders  and  produced  no  consistent  increases  in  reliability  for 
tests  given  to  high  school  freshmen  or  medical  students.  There  were  also  no 
consistent  tendencies  for  one  or  the  other  of  the  scoring  formulas  for  the  prob¬ 
abilistic  response  format  to  produce  higher  reliability  coefficients. 

Rippey  (1970)  compared  the  reliabilities  of  five  different  methods  of  scor¬ 
ing  probabilistic  item  responses.  Three  of  these  methods  were  RSSs;  the  fourth 
was  simply  the  probability  assigned  to  the  correct  answer,  and  the  fifth  was  a 
dichotomous  scoring  of  the  probabilistic  responses,  which  resulted  in  an  item 
score  of  1  if  the  probability  assigned  to  the  correct  answer  was  greater  than 
the  probability  assigned  to  any  other  alternative  and  a  score  of  0  otherwise. 

The  three  RSSs  used  were  the  modification  of  the  truncated  log  scoring  function, 
the  spherical  RSS,  and  another  RSS  called  the  Euclidean  RSS.  An  item  score  us¬ 
ing  the  Euclidean  RSS  is  computed  using  the  following  equation: 


[6] 


,  N) ,  and  X,  = 
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where  p^  =  probability  assigned  to  alternative  _k,  k_  =  (1,  2, 
criterion  group  mean  probability  assigned  to  alternative  k. 


!sing  Hoyt's  reliability  coefficient,  Rippey  found  that  the  test  scores 
obtained  by  summing  t fie  probabilities  assigned  to  the  correct  answer  yielded 
higher  average  reliability  coefficients  (.69)  than  any  of  the  other  scoring 
methods  a.nd  that  the  dichotomous  scoring  of  the  probabilistic  responses  yielded 
the  lowest  average  reliability  of  the  five  methods  (.47),  although  it  was  not 
much  lower  than  those  of  the  three  RSSs  (.49,  .50,  and  .58). 

In  comparing  two  RSSs  (quadratic  and  the  modification  of  the  truncated  log¬ 
arithmic  scoring  functions)  with  conventional  multiple-choice  test  scores, 
Koehler  (1971)  found  no  significant  differences  between  internal  consistency 
reliability  coefficients  for  the  test  scores  obtained  using  the  two  RSSs  and  the 
test  scores  from  the  conventional  multiple-choice  items.  He  found  evidence  of 
convergent  validity  for  both  the  probabilistic  and  conventional  item  formats 
and,  on  the  basis  of  this  evidence,  suggested  the  use  of  conventional  tests, 
since  they  are  "easier  to  administer,  take  less  testing  time,  and  do  not  require 
the  training  of  subjects  in  the  intricacies  of  the  confidence-marking  proce¬ 
dures"  (p.  302).  However,  his  conclusions  must  be  viewed  with  caution,  since 
each  of  his  tests  consisted  of  only  10  items. 
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Extraneous  Influences  on  the  Use  of 

Subjective  Probabilities  with  Multiple-Choice  Items 


Although  Koehler's  results  may  not  be  generali zable  due  to  the  small  number 
of  items  administered  in  each  format,  the  use  of  the  probabilistic  item  format 
has  been  questioned  for  other  reasons.  Hansen  (1971),  Jacobs  (1971),  Slakter 
(1967),  Echternacht,  Boldt,  and  Sellman  (1972),  Koehler  (1974),  and  Pugh  and 
Brunza  (1974),  along  with  several  others,  have  investigated  the  possibility  that 
the  increase  in  reliability  demonstrated  by  probabilistic  item  formats  is  due  to 
the  effect  of  a  personality  variable  or  response  style  variable  rather  than  a 
more  accurate  assessment  of  knowledge.  This  variable  has  been  alternately 
ca1 1  f’d  risk  taking,  certainty,  confidence,  and  cautiousness.  If  it  is  the  ef- 
fei  ;  of  this  response  style  variable  that  leads  to  increases  in  reliability  for 
probabilistic  responding  over  conventional  multiple-choice  items,  this  effect 
might  also  explain  the  fact  that  the  probabilistic  item  format  has  not,  in  gen¬ 
eral,  led  to  increases  in  the  validity  of  these  test  scores  over  that  of  test 
scores  obtained  from  conventional  multiple-choice  items. 

Studies  investigating  the  influence  of  these  various  personality  variables 
have  shown  mixed  results.  In  studies  where  conventional  multiple-choice  item 
scores  and  probabilistic  item  scores  were  obtained  (Koehler,  1974;  Echternacht, 
S( liman,  Boldt,  &  Young,  1971),  the  correlations  between  the  two  types  of  scores 
have  been  consistently  high  (.71  to  .83  for  the  Koehler  (1974)  study  and  .89  to 
.99  for  tile  Echternacht  et  al.  (1971)  study).  This  suggests  that  a  large  pro¬ 
portion  of  the  variation  in  the  probabilistic  test  scores  can  be  accounted  for 
by  the  conventional  test  scores.  The  question  being  posed,  though,  is  whether 
the  variation  in  the  probabilistic  test  scores  that  cannot  be  accounted  for  by 
the  conventional  test  scores  is  reliable  variance  due  to  increased  accuracy  of 
assessment  of  knowledge  or  due  to  personality  or  response  style  variables. 

To  determine  the  influence  of  these  personality  factors,  Koehler  (1974) 
embedded  seven  nonsense  items  in  a  40-item  vocabulary  test  and  told  examinees 
that  they  were  not  to  guess  the  answers  to  any  items  on  the  test.  The  nonsense 
items  were  items  with  no  correct  alternatives.  From  responses  to  these  nonsense 
items  he  calculated  two  confidence  measures: 

C I  =  proportion  of  nonsense  items  attempted  under  do-not-guess  instructions, 

and 


where  m  =  number  of  alternatives, 

n  =  number  of  nonsense  items,  and 
p.  .  =  probability  assigned  to  alternative  _i  on  item  j_. 

Since  the  nonsense  items  had  no  correct  alternatives,  an  examinee's  respon¬ 
ses  to  these  items  were  a  pure  measure  of  a  response  style  or  personality  vari¬ 
able  (confidence)  that  was  influencing  that  examinee's  responses.  Responses  to 
these  items  were  not  due  to  any  knowledge  the  examinee  possessed,  since  there 
were  no  correct  answers  to  those  items.  The  greater  the  deviation  of  these  in¬ 
dices  from  0,  the  higher  the  level  of  confidence  exhibited  by  the  examinee. 


Koehler  found  that  both  of  these  confidence  indices  were  significantly  negative¬ 
ly  correlated  with  three  probabilistic  test  scores  (spherical,  quadratic,  and 
the  modification  of  the  truncated  logarithmic  scoring  functions),  but  not  sig¬ 
nificantly  correlated  with  the  number-correct  scores  from  the  same  items.  The 
number-correct  scores  also  yielded  a  higher  internal  consistency  reliability 
coefficient  than  the  three  probabilistic  scores  (.85  versus  .82,  .80,  and  .74). 
On  the  basis  of  these  results,  Koehler  did  not  recommend  the  use  of  probabilis¬ 
tic  response  formats,  since  "it  would  appear  ...  that  confidence  responding 
methods  produce  variability  in  scores  that  cannot  be  attributed  to  knowledge  of 
subject  matter"  (p.  4). 

Hansen  (1971)  obtained  probabilistic  test  scores  and  scores  on  independent 
measures  of  personality  factors  such  as  risk  taking  and  test  anxiety.  He  devel¬ 
oped  a  measure  of  certainty  in  responding  to  probabilistic  response  formats 
which  is  essentially  the  average  absolute  deviation  of  a  response  vector  to  an 
item  from  a  response  vector  assigning  equal  probabilities  to  all  alternatives. 
Hansen's  study  showed  that  this  certainty  index  was  related  to  risk  taking  as 
measured  by  the  Kogan  and  Wallach  Choice  Dilemmas  Questionnaire  and  authoritari¬ 
anism  as  measured  by  a  version  of  the  F-scale,  developed  by  Christie,  Havel,  and 
Seidenberg  (1958).  However,  the  certainty  index  did  not  correlate  significantly 
with  scores  on  a  test  anxiety  questionnaire  or  scores  on  the  Gough-Sanf ord  Rig¬ 
idity  Scale. 

These  results  provide  more  information  concerning  the  nature  of  the  re¬ 
sponse  style,  but  there  are  problems  with  Hansen’s  (1971)  certainty  index,  which 
he  attempts  to  alleviate  but  does  not.  The  major  problem  with  this  index  is 
that  it  is  not  a  pure  measure  of  certainty.  This  certainty  measure  is  con¬ 
founded  by  an  examinee's  knowledge  concerning  an  item.  Hansen  attempted  to  par¬ 
tial  out  examinees'  knowledge  by  using  their  test  scores  as  a  predictor  in  a 
regression  equation  to  obtain  predicted  certainty  scores.  These  predicted  cer¬ 
tainty  scores  were  then  subtracted  from  the  observed  certainty  scores  to  obtain 
a  certainty  measure  free  of  the  influence  of  examinee  knowledge. 

Although  the  rationale  is  sound,  Hansen  did  not  accomplish  what  he  set  out 
to  do.  The  test  score  he  used  as  a  predictor  was  not  a  pure  or  even  relatively 
pure  measure  of  knowledge.  The  test  scores  were  probabilistic  test  scores  com¬ 
puted  from  the  spherical  RSS,  This  scoring  system  results  in  scores  that  repre¬ 
sent  a  confounding  of  certainty  and  knowledge.  Therefore,  by  partialling  these 
probabilistic  test  scores  from  the  certainty  index,  it  is  unclear  exactly  what 
the  residual  certainty  index  represents,  since  both  knowledge  and  some  certainty 
have  been  partialled  out.  Hansen's  results  were  then  based  upon  the  relation¬ 
ship  of  various  personality  variables  with  a  certainty  index  confounded  with 
knowledge,  and  the  relationship  of  these  same  personality  variables  with  a  re¬ 
sidual  certainty  index  whose  composition  is  somewhat  ambiguous.  Hansen's  re¬ 
sults  might  best  be  viewed  with  caution. 

Pugh  and  Brunza  (1974)  conducted  a  study  similar  to  that  of  Hansen  (1971), 
except  that  they  used  a  24-item  vocabulary  test  and  scored  it  using  the  proba¬ 
bility  assigned  to  the  correct  answer  as  the  item  score.  They  also  obtained 
scores  on  an  independent  nonprobabilistically  scored  vocabulary  test,  and  mea¬ 
sures  of  risk  taking,  degree  of  external  control,  and  cautiousness.  They  fol¬ 
lowed  Hansen's  regression  procedure  to  obtain  a  certainty  measure  free  of  the 


confounding  effects  of  knowledge  and  were  more  successful  than  Hansen.  They 
used  the  independent  vocabulary  tet'  score  as  a  predictor  of  the  same  certainty 
index  that  Hansen  used  and  then  calculated  a  residual  certainty  index  by  sub¬ 
tracting  the  predicted  certainty  score  from  the  observed  certainty  score.  Since 
the  independent  vocabulary  test  was  a  relatively  pure  measure  of  knowledge,  par- 
tialling  its  effect  from  the  observed  certainty  index  resulted  in  a  residual 
certainty  index  that  (1)  was  a  measure  of  the  certainty  displayed  in  responding 
to  multiple-choice  items  in  a  probabilistic  fashion  and  (2)  was  not  related  to 
knowledge  possessed  by  examinees  concerning  the  items. 

Pugh  and  Brunza  (1974)  reported  that  this  residual  certainty  measure  was 
not  very  reliable  (.32  internal  consistency  reliability)  and  that  it  correlated 
significantly  with  risk-taking  scores  obtained  from  the  Kogan  and  Wallach  Choice 
Dilemmas  Questionnaire  but  not  with  the  measures  of  cautiousness  and  external 
control  they  had  obtained.  Although  this  evidence  of  the  influence  of  variables 
other  than  knowledge  on  probabilistic  test  scores  might  serve  as  a  deterrent  to 
the  use  of  these  scoring  systems,  Pugh  and  Brunza  noted  that  "there  is  no  evi¬ 
dence  in  either  study  [Pugh  &  Brunza,  1974,  or  Hansen,  1971)  that  these  factors 
are  more  operative  than  in  traditional  tests"  (p.  6). 

Echternscht  et  al .  (1971)  scored  answi  sheets  of  daily  quizzes  obtained 
from  two  Air  Force  training  courses  using  a  truncated  logarithmic  scoring  func¬ 
tion  and  number  correct.  They  found  that  using  the  number-correct  score,  the 
shift  of  the  trainees,  and  a  number  of  personality  variables  such  as  test  anxie¬ 
ty,  risk  taking,  and  rigidity  as  predictors  of  the  probabilistic  test  scores  did 
not  account  for  significantly  more  of  the  variation  in  the  probabilistic  test 
scores  than  was  accounted  for  when  using  only  number-correct  scores  and  shJft  of 
the  trainees  as  predictors.  This  is  evidence  that  the  personality  variables  did 
not  operate  to  a  greater  extent  in  a  probabilistic  testing  situation  than  in  a 
conventional  multiple-choice  testing  situation. 

Thus,  these  studies  show  some  relationship  of  probabilistic  test  scores  to 
personality  variables  (primarily  risk-taking  tendencies);  but  they  also  show 
that  these  influences  do  not  seem  to  be  greater  in  probabilistic  testing  situa¬ 
tions  than  in  conventional  testing  situations. 

Use  of  Alternate  Item  Types 

The  research  reviewed  above  relied  on  the  multiple-choice  item  type  and 
varied  the  method  of  responding  to  that  type  of  item;  however,  some  researchers 
have  advocated  the  use  of  entirely  different  item  types,  such  as  free-response 
items,  to  aid  in  the  assessment  of  partial  knowledge.  Some  of  these  alternate 
item  types  avoid  many  of  the  problems  inherent  in  multiple-choice  items  but  are 
subject  to  problems  of  their  own.  For  example,  the  free-response  item  type 
avoids  the  problem  of  random  guessing  among  a  number  of  alternatives  and  has  the 
potential  to  provide  a  large  amount  of  information  concerning  what  the  examinee 
does  or  does  not  know,  but  it  is  also  more  time-consuming  to  administer  and 
score,  and  may  cover  much  less  material  than  is  possible  with  a  multiple-choice 
format.  Consequently,  if  there  are  any  time  constraints  on  testing,  fewer  items 
can  be  administered.  Practical  problems  with  scoring  many  of  these  alternate 
item  types  have  prevented  widespread  use  of  several  of  them. 


Although  comparisons  of  the  psychometric  properties  of  multiple-choice 
items  with  several  alternate  item  types  are  planned,  the  present  research  fo¬ 
cused  on  comparisons  of  the  probabilistic  response  formats.  This  study  has  at¬ 
tempted  to  answer  the  following  questions: 

1.  Does  a  personality  variable  such  as  certainty  affect  probabilistic  test 
scores  on  an  ability  test  to  a  greater  degree  than  it  affects  conven¬ 
tional  test  scores  on  the  same  ability  test? 

2.  If  the  effect  of  a  personality  variable  can  be  discounted,  what  types 
of  scoring  systems  are  best  for  multiple-choice  items  on  an  ability 
test  requiring  probabilistic  responses? 


Method 

Test  Items 

Thirty  multiple-choice  analogy  items  were  chosen  from  a  pool  of  items  ob¬ 
tained  from  Educational  Testing  Service  (ETS)  containing  former  SCAT  and  STEP 
items.  Each  item  consisted  of  an  item  stem  and  four  alternatives.  The  pool  of 
items  had  been  parameterized  by  ETS  on  groups  of  high  school  students  using  the 
computer  program  LOGIST  (Wood,  Wingersky,  &  Lord,  1976)  with  a  three-parameter 
logistic  model,  resulting  in  item  response  theory  discrimination,  difficulty , 
and  guessing  parameters  calculated  from  large  numbers  of  examinees  for  each 
item.  The  30  items  were  chosen  from  a  pool  of  approximately  300  analogy  items 
to  represent  a  uniform  range  of  discrimination  and  difficulty  parameters.  The 
parameters  for  the  chosen  items  are  in  Appendix  Table  A.  The  item  discrimina¬ 
tion  parameters  ranged  from  approximately  a  **  ,6  to  a  «  1.4,  with  a  mean  of  .975 
and  a  standard  deviation  of  .244,  while  the  difficulty  parameters  ranged  from 
approximately  _b  =  -.5  to  J)  =  2.5,  with  a  mean  of  .961  and  a  standard  deviation 
of  .887.  The  range  of  difficulty  parameters  was  not  chosen  to  be  symmetric 
about  zero  because  the  available  examinees  constituted  a  more  select  group  than 
the  group  whose  responses  were  used  to  parameterize  the  items.  The  guessing 
parameters  for  these  items  ranged  from  c^  ■  .09  to  c  »  .38,  with  a  mean  of  .20 
and  a  standard  deviation  of  .06. 

Test  Administration 

The  30  multiple-choice  analogy  items  chosen  were  then  administered  to  299 
psychology  and  biology  undergraduate  students  at  the  University  of  Minnesota 
during  the  1979-1980  academic  year.  Students  received  two  points  toward  their 
course  grade  (either  introductory  psychology  or  biology)  for  their  partici¬ 
pation.  Items  were  administered  by  computer  to  permit  checking  of  responses  to 
be  sure  that  item  response  instructions  were  carefully  followed. 

The  examinees  were  instructed  to  respond  to  each  item  by  assigning  a  proba¬ 
bility  to  each  of  the  four  alternatives.  This  probability  was  to  correspond  to 
the  examinee's  belief  in  the  correctness  of  each  alternative,  with  the  addition¬ 
al  restriction  that  the  probabilities  assigned  to  all  of  the  alternatives  for  an 
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item  sum  to  one.  Specifically,  for  each  item,  the  examinees  were  asked  to  dis¬ 
tribute  100  points  among  the  four  alternatives  provided  for  each  item  according 
to  their  belief  as  to  whether  or  not  the  alternative  was  the  correct  alternative 
for  that  item.  The  total  number  of  points  assigned  to  all  of  the  alternatives 
for  an  itan  had  to  equal  100.  Since  the  tests  were  computer  administered,  item 
responses  were  summed  immediately  to  ensure  that  the  responses  to  the  alterna¬ 
tives  did  indeed  sum  to  100  (sums  of  99  and  101  were  also  considered  valid  to 
allow  for  rounding).  The  points  assigned  to  each  alternative  were  then  con¬ 
verted  into  probabilities  by  dividing  the  response  to  each  alternative  by  100. 

To  insure  that  the  examinees  understood  both  how  to  use  the  computer  and 
how  to  respond  to  the  multiple-choice  items  in  a  probabilistic  fashion,  a  de¬ 
tailed  set  of  instructions  preceded  each  test  (see  Appendix  Table  B) .  If  an 
examinee  responded  incorrectly  to  an  instruction,  the  computer  would  display  an 
appropriate  error  message  on  the  CRT  screen  and  the  examinee  would  have  to  re¬ 
spond  correctly  before  proceeding  to  the  next  screen.  If  an  examinee  again  re¬ 
sponded  inappropriately  to  an  instruction,  a  test  proctor  was  called  by  the  com¬ 
puter  to  provide  additional  help  to  the  examinee  in  understanding  the  instruc¬ 
tions.  Several  examples  and  explanations  of  methods  of  responding  to  probabi¬ 
listic  items  were  provided.  Examinees,  with  few  exceptions,  did  not  have  any 
difficulty  understanding  how  to  respond  to  the  items.  If,  in  responding  to  an 
item,  an  examinee's  responses  did  not  sum  to  99,  100,  or  101,  the  examinee  was 
immediately  asked  to  reenter  his/her  responses  until  an  appropriate  sum  was  en¬ 
tered. 


The  item  responses  obtained  from  these  299  examinees  were  then  scored  using 
five  different  scoring  formulas  to  determine  which  of  these  scoring  formulas 
yielded  the  most  reliable  and  valid  scores.  The  five  different  scoring  formulas 
used  were: 

1.  The  probability  assigned  to  the  correct  alternative  by  the  examinee 
(PACA)  was  used  as  the  itan  score.  This  scoring  formula  yields  scores 
that  range  from  0  to  1.00. 

2.  The  second  type  of  item  score  (AIKEN)  was  computed  from  a  variation  of  a 
scoring  formula  developed  by  Aiken  (1970),  which  is  a  function  of  the 
absolute  difference  between  the  correct  response  vector  for  an  item  and 
the  obtained  response  vector: 


Item  score 


max 


[8] 


m 

where  D  =  Z 
i=l 


19] 


m  =  number  of  alternatives, 

Pa£  =  probability  assigned  to  the  alternative  by  the  examinee; 
Pe^  =  expected  probability  for  alternative;  and 
D_0„  =  maximum  value  of  D,  which  was  2.00  for  all  of  these  items. 

XU  aX 

Each  correct  response  vector  would  contain  three  0's  and  one  1,  while 


the  obtained  response  vector  would  contain  four  probabilities  that  sum 
to  1.00.  For  example,  for  an  item  where  the  second  alternative  was  the 
correct  alternative,  the  correct  response  vector  would  be  0,  1.00,  0,  0. 
A  response  vector  that  might  have  been  obtained  for  this  item  is  .20, 
.60,  .20,  0.  For  this  obtained  response  vector  the  item  score  would  be 
computed  as  follows: 

Item  score  -  1  -  jj.0---A0|  *  I  !■' q+  IfclM  ±  IM] 


=  1 


.80 

2.00 


.60 


[10] 


This  scoring  formula  also  yields  scores  that  range  from  0  to  1.00. 

3.  The  quadratic  RSS  (QUAD),  is  defined  by  Equation  3.  This  scoring  formu¬ 
la  yields  scores  that  range  from  -1.00  to  1.00. 

4.  The  spherical  RSS  (SPHER)  is  defined  in  Equation  2.  This  scoring  formu¬ 
la  yields  scores  that  range  from  0  to  1.00. 

5.  A  modification  of  the  truncated  logarithmic  scoring  function  (TLOG). 

This  scoring  formula  is  a  good  approximation  to  the  logarithmic  Rss.  It 
is  a  very  good  approximation  throughout  most  of  the  possible  score 
range,  and  is  defined  by  Equation  5.  This  scoring  formula  yields  scores 
from  0  to  1.00.  The  actual  formula  used  here  to  obtain  scores  via  a 
truncated  logarithmic  scoring  function  utilizes  a  scaling  factor  of  5 
rather  than  the  usual  scaling  factor  of  1  or  2.  It  was  necessary  to 
increase  this  scaling  factor  to  maintain  a  logical  progression  of 
scores,  since  the  probability  assigned  to  the  correct  answer  for  some 
items  was  as  low  as  .01,  Since  the  log  of  .01  is  -4.6052,  the  scaling 
factor  had  to  be  a  5  (actually  only  some  number  slightly  higher  than 
4.6052)  in  order  that  the  scores  progress  in  an  orderly  fashion  from  0 
to  1.00  according  to  the  probability  assigned  to  the  correct  answer. 

This  alleviated  the  problem  of  assigning  negative  scores  to  examinees 
who  had  assigned  very  small  probabilities  to  the  correct  answer  while 
assigning  a  score  of  0  (a  higher  score)  to  examinees  who  had  assigned  a 
zero  probability  to  the  correct  answer.  The  actual  TLOG  scoring  formula 
used  is  Equation  11. 

5  +  log (p  ) 

- - —  ,  -01  <  pc  £  i.oo] 

)  [111 

0  ,  0  <pc  <  .01 


Total  test  scores  for  all  of  the  scoring  methods  were  obtained  by  summing  all  30 
item  scores  for  each  of  the  30  items. 


Determining  the  Effect  of  Certainty 


To  determine  the  effect  of  an  examinee's  certainty  or  propensity  to  take 


risks  when  responding  to  probabilistic  items,  Hansen's  (1971)  certainty  index 
was  computed  for  each  examinee  using  the  following  formula: 


where 

Gj.  =  certainty  index, 

n^  =  number  of  items  in  test, 
mj  =  number  of  alternatives  for  item  j,  and 

=  probability  assigned  to  alternative  _i  of  item  . 

This  certainty  index  is  a  function  of  the  absolute  difference  between  the  proba¬ 
bilities  assigned  to  the  four  alternatives  and  .25,  averaged  over  items.  Since 
the  probabilities  assigned  to  each  alternative  are  dependent  upon  both  an  exam¬ 
inee's  knowledge  and  his/her  level  of  certainty,  this  certainty  index  is  not  a 
"pure"  measure  of  certainty,  but  is  confounded  with  knowledge  about  the  item. 

To  determine  the  effect  of  this  response  style  variable,  it  was  first  nec¬ 
essary  to  obtain  a  "pure”  measure  of  certainty.  This  relatively  pure  measure  of 
certainty  was  obtained  by  scoring  the  probabilistic  responses  dichotomously  and 
then  partialling  the  effect  of  this  knowledge  variable  out  of  the  certainty  in¬ 
dices.  A  dichotomous  test  score  was  obtained  from  the  probabilistic  responses 
by  making  the  assumption  that  under  conventional  "choose-the-correct-answer” 
instructions,  examinees  would  choose  the  alternative  to  which  they  assigned  the 
highest  probability  under  the  probabilistic  instructions.  Thus,  for  each  item, 
the  alternative  assigned  the  highest  probability  by  the  examinee  was  chosen  as 
the  alternative  the  examinee  would  have  chosen  under  traditional  multiple-choice 
instructions.  A  score  of  1  was  assigned  if  that  alternative  was  the  correct 
answer  and  a  score  of  0  was  assigned  otherwise.  When  more  than  one  alternative 
was  assigned  the  highest  probability,  one  of  those  alternatives  was  randomly 
chosen  as  the  alternative  the  examinee  would  have  chosen.  This  procedure  at¬ 
tempted  to  simulate  the  decision-making  process  of  an  examinee  in  choosing  a 
correct  answer  to  an  item. 

This  dichotomous  test  score  was  used  in  a  regression  equation  to  predict 
the  certainty  index.  The  predicted  certainty  index  was  then  subtracted  from  the 
actual  certainty  index  to  obtain  a  residual  certainty  index.  This  residual  cer¬ 
tainty  index  constituted  a  "pure"  measure  of  certainty.  This  pure  certainty 
index  was  partialled  out  of  the  probabilistic  test  scores  using  the  same  method 
as  that  used  to  partial  the  dichotomous  test  scores  out  of  the  original  certain¬ 
ty  index.  The  pure  certainty  index  was  also  used  to  predict  the  probabilistic 
test  score.  The  predicted  probabilistic  test  score  was  then  subtracted  from  the 
probabilistic  test  score  to  obtain  a  residual  probabilistic  test  score  that  was 
unassociated  with  the  pure  certainty  index. 

As  a  result  of  these  partialling  operations,  the  following  measures  were 
available  for  each  of  the  five  scoring  methods: 

1.  Probabilistic  test  score.  This  score  represents  a  confounding  of  knowl¬ 
edge  and  certainty. 

2.  Dichotomous  test  score.  This  score  represents  a  pure  knowledge  index 


and  is  the  dichotomous  scoring  of  the  probabilistic  responses. 

3.  Residual  score.  This  score  is  the  probabilistic  test  score  with  the 
pure  certainty  index  partialled  out,  and  thus  represents  the  pure  knowl¬ 
edge  component  of  the  probabilistic  scores. 

4.  Certainty  index.  This  measure  represents  a  confounding  of  knowledge  and 
certainty. 

5.  Residual  certainty  index.  This  measure  is  the  certainty  index  with  the 
pure  knowledge  index  (the  dichotomous  test  score)  partialled  out  and 
thus  represents  a  pure  certainty  index. 

Evaluative  Criteria 

Reliability  and  validity  coefficients  were  computed  for  both  the  probabi¬ 
listic  and  the  residual  test  scores.  The  reliability  coefficients  were  internal 
consistency  reliability  coefficients  calculated  using  coefficient  alpha.  The 
validity  coefficients  were  the  correlations  between  test  score  and  reported 
grade-point  average.  For  each  of  the  five  scoring  methods  used,  the  validity 
and  reliability  of  the  residual  scores  was  compared  with  that  of  the  original 
probabilistic  test  scores.  If  there  was  any  difference  between  the  validities 
and  the  reliabilities  of  the  probabilistic  and  the  residual  scores,  they  could 
be  attributed  to  the  effect  of  certainty  in  responding,  since  the  only  differ¬ 
ence  between  the  two  scores  was  that  the  effect  of  certainty  had  been  removed 
from  the  residual  scores. 

Factor  analyses  of  the  item  scores  (both  probabilistic  and  residual)  for 
each  of  the  five  scoring  formulas  were  performed  using  a  principal  axis  factor 
extraction  method.  The  number  of  factors  extracted  for  each  of  the  scoring  for¬ 
mulas  was  determined  through  parallel  analyses  (Horn,  1965)  performed  separately 
for  each  scoring  formula,  using  randomly  generated  data  with  the  same  numbers  of 
items  and  examinees  as  the  real  data  and  with  item  difficulties  (proportion  cor¬ 
rect)  equated  with  the  real  data.  Coefficients  of  congruence  and  correlations 
between  factor  loadings  for  each  of  the  five  scoring  formulas  were  computed. 


Results 

Score  Intercorrelations 

Correlations  between  probabilistic  test  scores,  residual  test  scores,  di¬ 
chotomous  scores,  the  certainty  index,  and  the  residual  certainty  index  for  each 
of  the  scoring  formulas  are  presented  in  Table  1.  Since  the  AIKEN  scoring  for¬ 
mula  resulted  in  item  scores  and  correlations  that  were  identical  to  that  of  the 
PACA  scoring  formula,  only  the  PACA  results  are  reported. 

As  expected,  due  to  the  partialling  procedure,  the  correlation  between  the 
residual  certainty  index  and  the  dichotomous  score,  and  the  correlation  between 
the  residual  certainty  index  and  the  residual  score,  were  both  zero  for  all 
scoring  methods.  The  correlation  between  the  original  certainty  index  and  the 
dichotomous  score  (.71),  and  the  correlation  between  the  original  certainty  in¬ 
dex  and  the  residual  certainty  index  (.71),  were  exactly  the  same  for  all  four 
scoring  formulas.  This  is  due  to  the  fact  that  the  three  indices — the  original 
certainty  index,  the  residual  certainty  index,  and  the  dichotomous  score — do  not 


Table  1 

Intercorrelations  of  Scores  for  Multiple-Choice  Items  with  a 
Probabilistic  Response  Format  Scored  by  Four  Scoring  Methods 


Scoring  Method 
and  Score 

Probabi¬ 

listic 

Di cho  t- 

omous 

Certainty 

Residual 

Certainty 

Residual 

Score 

Quadratic  RSS  (lower  triangle)  and 

Spherical 

RSS  (upper 

triangle) 

Probabilistic 

— 

.94** 

.64** 

-.04 

1.00** 

Dichotomous 

.91** 

— 

.71** 

.00 

.94** 

Certainty 

.56** 

.71** 

— 

.71** 

.67** 

Residual  Certainty  - 

•.12* 

.00 

.71** 

— 

-.00 

Residual  Score 

.99** 

.92** 

.65** 

.00 

— 

Truncated  Log  RSS  (lower 

triangle) 

and  PACA  (upper  triangle) 

Probabilistic 

— 

.93** 

.83** 

.24** 

.97** 

Dichotomous 

.85** 

— 

.71** 

.00 

.96** 

Certainty 

.43** 

.71** 

— 

.71** 

.68** 

Residual  Certainty  - 

■.25** 

.00 

.71** 

— 

-.00 

Residual  Score 

.97** 

.88** 

.62** 

.00 

— 

*p  <  .05 

**p  <  .01 

change  with  the  particular  scoring  formula  used;  they  are  constant  for  each  in¬ 
dividual  across  scoring  methods.  These  two  significant  correlations,  along  with 
the  significant  correlations  exhibited  for  each  of  the  scoring  formulas  between 
the  certainty  index  and  the  residual  score  (.65,  .67,  .62,  and  .68  for  QUAD, 
SPHER,  TLOG,  and  PACA,  respectively),  show  that  the  original  certainty  index  is 
indeed  related  to  both  "knowledge"  as  measured  by  traditional  multiple-choice 
tests  (the  dichotomous  scores)  and  "certainty”  unconfounded  with  "knowledge” 

(the  residual  certainty  index). 

The  correlations  between  the  probabilistic  test  scores  and  the  dichotomous 
test  scores  were  .91,  .94,  .85,  and  .93  for  the  QUAD,  SPHER,  TLOG,  and  PACA 
scoring  methods,  respectively.  Using  approximate  significance  tests  for  corre¬ 
lations  obtained  from  dependent  samples  (Johnson  &  Jackson,  1959,  pp.  352-358), 
all  of  the  pairwise  comparisons  among  these  correlations  were  significantly  dif¬ 
ferent  from  each  other  at  the  .05  level  of  significance.  Practically,  the  only 
correlation  of  these  four  that  appears  different  from  the  others  is  that  of  TLOG 
(.85  as  opposed  to  .91,  .94,  and  .93  for  the  other  scoring  methods).  Squaring 
these  four  correlations  yields  the  proportion  of  variance  in  the  probabilistic 
test  scores  accounted  for  by  the  dichotomous  test  scores.  The  squared  correla¬ 
tions  are  .83,  .88,  .72,  and  .86  for  the  QUAD,  SPHER,  TLOG,  and  PACA  scoring 
procedures . 

The  correlations  between  the  residual  certainty  index  (the  "pure”  certainty 
measure)  and  the  probabilistic  test  scores  were  -.12,  -.04,  -.25,  and  .24  for 
the  QUAD,  SPHER,  TLOG,  and  PACA  scoring  formulas,  respectively.  The  correla¬ 
tions  for  the  QUAD  and  SPHER  scoring  formulas  were  not  significantly  different 
from  zero  at  the  .01  level  of  significance  and  thus  do  not  account  for  signifi¬ 
cant  amounts  of  the  variance  of  the  probabilistic  test  scores.  Squaring  the 
correlations  that  are  significantly  different  from  zero  results  in  squared  cor- 


relations  of  .06  for  both  the  TLOG  and  PACA  scoring  formulas.  Thus,  certainty 
as  measured  by  the  residual  certainty  index  accounts  for  no  more  than  6%  of  the 
variance  of  any  of  the  probabilistic  test  scores. 


The  correlations  in  Table  1  between  the  probabilistic  test  scores  and  the 
residual  scores  are  very  high  for  all  four  scoring  formulas  (.99,  1.00,  .97,  and 
.97,  for  QUAD,  SPHER,  TLOG,  and  PACA,  respectively).  These  correlations  are 
highest  (.99  and  1.00)  for  the  QUAD  and  SPHER  scoring  formulas,  whose  correla¬ 
tions  between  the  probabilistic  test  score  and  residual  certainty  index  were  not 
significantly  different  from  zero  (—.12  and  -.04);  these  correlations  squared 
(.98  and  1.00)  show  that  almost  all  of  the  variance  in  the  QUAD  probabilistic 
test  scores,  and  all  of  the  variance  of  the  SPHER  probabilistic  test  scores,  is 
accounted  for  by  the  residual  scores  (representing  "knowledge"  concerning  the 
items) . 

The  correlations  between  the  dichotomous  test  scores  and  the  residual 
scores  are  high  and  significantly  different  from  zero  for  all  of  the  scoring 
formulas  (.92,  .94,  .88,  and  .96  for  QUAD,  SPHER,  TLOG,  AND  PACA  scoring  formu¬ 
las,  respectively).  This  result  is  expected,  since  both  the  residual  scores  and 
the  dichotomous  scores  are  relatively  pure  measures  of  knowledge. 

It  was  also  expected  that  the  correlations  between  the  original  certainty 
index  and  the  probabilistic  test  scores  for  the  various  scoring  methods  would  be 
greater  than  the  correlations  between  this  certainty  index  and  the  dichotomous 
scores,  since  the  probabilistic  test  scores  and  the  original  certainty  index 
both  represent  a  confounding  of  certainty  and  knowledge,  while  the  dichotomous 
scores  are  a  measure  of  knowledge  less  confounded  by  certainty.  This  occurred 
only  for  the  PACA  scoring  method,  which  was  the  only  scoring  method  that  was  not 
an  RSS.  The  correlation  between  the  certainty  index  and  probabilistic  test 
score  was  significantly  greater  than  the  correlation  between  the  dichotomous 
score  and  the  certainty  index  (.83  vs. 71)  for  the  PACA  scoring  formula,  and  was 
significantly  less  (using  the  dependent  samples  test  of  significance  for  corre¬ 
lations  and  a  .05  level  of  significance)  than  .71  (.56,  .64  and  .43)  for  the 
other  three  scoring  formulas. 

Validity  and  Reliability 

Table  2  shows  the  validity  and  internal  consistency  reliability  coeffi¬ 
cients  for  the  probabilistic  test  scores  obtained  from  the  various  methods  of 
scoring  the  multiple-choice  items  with  a  probabilistic  response  format.  The 
validity  coefficients  were  all  significantly  different  from  zero  but  were  not 
significantly  different  from  each  other,  using  a  dependent  samples  test  of  sig¬ 
nificance  for  correlation  coefficients  (Johnson  &  Jackson,  1959,  pp.  352-358) 
and  maintaining  the  experimentwise  error  at  a  .01  alpha  level. 

The  reliability  coefficients  were  all  significantly  different  from  zero  and 
significantly  different  from  each  other  (using  the  Pitman  procedure  described  in 
Feldt,  1980,  for  testing  the  significance  of  differences  between  coefficient 
alpha  for  dependent  samples  using  a  .01  significance  level).  The  PACA  scoring 
method  yielded  the  highest  internal  consistency  reliability  (.91)  followed  by 
SPHER  (.88),  QUAD  (.87),  and  TLOG  (.84). 
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Table  2 

Validity  Correlations  of  Test  Scores  with 
Reported  GPA  and  Alpha  Internal  Consistency 
Reliability  Coefficients  for  Multiple-Choice  Items 
with  a  Probabilistic  Response  Format  (N-299) 


Scoring 

Validity 

Reliability 

Me thou 

r 

L* 

a 

L* 

Unpartialled  Scores 
Quadratic  RSS 

.18 

<.001 

.87 

<.001 

Spherical  RSS 

.18 

<.001 

.88 

<.001 

Truncated  Log 

RSS 

.18 

<.001 

.84 

<.001 

PACA 

.17 

<.001 

.91 

<.001 

Residual  Scores 
Quadratic  RSS 

.13 

.011 

.87 

<.001 

Spherical  RSS 

.13 

.011 

.88 

<.001 

Truncated  Log 

RSS 

.14 

.006 

.84 

<•001 

PACA 

.12 

.017 

.91 

<.001 

S-.-, 


*Probability  of  rejecting  null  hypothesis  of  no 
significant  difference  from  zero. 

Validity  and  internal  consistency  reliability  coefficients  for  the  residual 
scores  are  also  shown  in  Table  2.  The  reliability  coefficients  for  the  residual 
scores  are  exactly  the  same  as  the  reliability  coefficients  for  the  probabilis¬ 
tic  test  scores.  The  validity  coefficients  for  the  residual  scores  were  all 
significantly  different  from  zero  but  not  from  each  other  (.01  significance  lev¬ 
el),  and  these  validity  coefficients  were  significantly  lower  (p  <,  .05)  for  the 
residual  scores  than  for  the  unpartialled  probabilistic  test  scores  (.18  vs.  .13 
for  QUAD,  .18  vs.  .13  for  SPHER,  .18  vs.  .14  for  TLOG,  and  .17  vs.  .12  for 
PACA).  This  decrease  in  the  magnitude  of  the  validity  coefficients  of  the  re¬ 
sidual  scores  is  not  due  to  a  restriction  in  range  problem,  since  the  range  of 
scores  for  the  probabilistic  test  scores  was  very  similar  to  that  of  the  residu¬ 
al  scores,  as  is  shown  in  Table  3. 

Table  3 

Range  of  Scores  for  Probabilistic  and 
Residual  Test  Scores 


Scoring 

Method 

Quad ratic 
Spherical 
Truncated  Log 
PACA 


Probabilistic 


Residual 
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Factor  Analysis  of  Probabilistic  Test  Scores 


Factor  analyses  of  the  unpartlal led  probabilistic  and  residual  test  scores 
yielded  virtually  identical  results;  therefore,  only  the  results  of  the  factor 
analyses  of  the  probabilistic  test  scores  are  reported  here. 

Figures  la  to  Id  show  the  results  of  the  parallel  analyses  performed  for 
each  of  the  scoring  methods  (numerical  data  are  in  Appendix  Table  C) .  The  ei¬ 
genvalues  obtained  from  the  principal  axes  factor  analysis  of  the  random  data 
were  all  low;  as  expected,  no  factor  accounted  for  significantly  more  variation 
in  the  items  than  any  other  factor.  In  comparing  the  eigenvalues  of  the  actual 
data  with  those  from  the  random  data,  it  is  clear  that  one  strong  factor  is  pre¬ 
sent  for  all  of  the  scoring  methods.  A  second  factor  also  appears  for  each  of 
the  scoring  methods  with  eigenvalues  greater  than  that  of  the  second  factor  for 
the  random  data,  but  the  eigenvalue  for  the  second  factors  of  the  random  and 
actual  data  are  so  close  that  the  second  factor  (and  third  factor  for  TLOG)  for 
the  actual  data  can  be  considered  to  be  the  same  strength  as  a  random  factor. 

On  the  basis  of  these  results,  one-factor  principal  axis  factor  solutions  were 
obtained  for  each  of  the  scoring  methods  and  are  shown  in  Table  4. 

The  factor  loadings  in  Table  4  are  positive  and  fairly  high  for  all  items 
and  all  scoring  formulas,  indicating  a  global  factor  for  each  of  the  scoring 
methods.  The  magnitudes  of  the  eigenvalues  show  that  this  factor  accounted  for 
more  of  the  variance  of  the  item  responses  for  the  PACA  scoring  formula  (26%) 
than  for  any  of  the  other  scoring  formulas  (19.9%,  20.9%,  and  17.4%  for  the 
QL'AD,  SPHER,  and  TLOG  scoring  formulas). 

The  correlations  between  factor  loadings  across  the  30  items  for  the  vari¬ 
ous  scoring  methods  are  presented  in  the  lower  left  triangle  of  Table  5,  while 
coefficients  of  congruence  are  reported  in  the  upper  right  triangle  of  Table  3. 
The  coefficients  of  congruence  are  at  the  maximum  of  1.00  for  all  of  the  pairs 
of  factor  loadings  and  the  correlations  among  all  of  the  factor  loadings  are 
very  high,  except  for  the  correlation  between  the  factor  loadings  for  the  PACA 
and  TLOG  scoring  methods,  which  was  only  .80.  The  fact  that  all  of  the  coeffi¬ 
cients  of  congruence  are  equal  to  the  maximum  value  for  this  index  is  due  to  the 
dependence  of  this  index  upon  the  magnitude  and  sign  of  the  factor  loadings. 
Gorsuch  (1974,  p.  254)  notes  that  this  index  will  be  high  for  factors  whose 
loadings  are  approximately  the  same  size  even  if  the  pattern  of  loadings  for  the 
two  factors  is  not  the  same. 

Discussion  and  Conclusions 


The  Influence  of  Certainty 

The  evidence  concerning  the  effect  of  examinee  certainty  on  probabilistic 
test  scores  suggests  that  certainty  as  a  response  style  variable  has  a  small, 
almost  negligible  effect,  on  the  probabilistic  test  scores  obtained  in  this 
study.  The  reliability  coefficients  for  the  five  scoring  methods  were  exactly 
the  sane  for  the  probabilistic  and  residual  test  scores,  indicating  that  the 
certainty  variable  was  not  contributing  reliable  variance  to  the  probabilistic 
test  scores  and  was  artifically  increasing  the  reliability  coefficients.  The 
factor  structures  of  the  probabilistic  test  scores  and  the  residual  test  scores 
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Table  4 

Factor  Loadings  on  the  First  Factor 
for  Multiple-Choice  Items  with  a 
Probabilistic  Response  Format 


It  em 
Number 


Scoring  Method 
SPHER  PACA 


Table  5 

Correlations  (Lower  Triangle)  and  Coefficients 
of  Congruence  (Upper  Triangle)  Between 
Factor  Loadings  Obtained  for  Four  Scoring  Methods 


Scoring 


Me  t  hod 

QUAD 

SPHER 

TLOG 

PACA 

QUAD 

SPHER 

.97 

- 

TLOG 

.95 

.92 

- 

PACA 

.90 

.93 

.80 

- 

were  also  identical.  The  factor  structure  and  internal  consistency  reliability 
data  (which  are  both  based  upon  the  interitem  correlations  for  each  scoring 
method),  indicate  no  effect  of  the  certainty  variable  on  probabilistic  test 
scores  above  and  beyond  the  effect  on  the  residual  test  scores  (i.e.,  the  proba¬ 
bilistic  test  scores  with  the  "pure"  certainty  index  partialled  out).  This  lack 
of  effect  is  demonstrated  by  the  extremely  high  correlations  between  the  scores 
derived  assuming  conventional  multiple-choice  instructions  (the  dichotomous 
score)  and  the  probabilistic  test  scores  for  all  of  the  scoring  methods  studied, 
and  by  the  extremely  low  correlations  between  the  "pure"  certainty  index  (the 
residual  certainty  index)  and  the  probabilistic  test  scores  for  each  scoring 
method.  Since  the  dichotomous  test  scores  simulate  testing  conditions  under 
conventional  multiple-choice  instructions  to  choose  the  one  correct  answer, 
these  high  correlations  suggest  that  the  greatest  portion  of  the  variability  in 
the  probabilistic  test  scores  for  all  of  the  scoring  formulas  is  not  different 
from  that  present  in  scores  obtained  with  traditional  multiple-choice  tests. 

The  validity  coefficients  did  show  an  effect  of  the  certainty  index  on  the 
probabilistic  test  scores.  The  significant  decrease  in  the  validity  coeffi¬ 
cients  which  occurs  when  the  "pure"  certainty  index  is  partialled  from  the  prob¬ 
abilistic  test  scores  is  evidence  of  some  effect  of  the  certainty  variable  on 
the  probabilistic  test  scores.  However,  even  though  the  decrease  was  signifi¬ 
cant  for  all  of  the  scoring  formulas,  the  practical  difference  was  small.  The 
validity  coefficients  of  the  probabilistic  test  scores  were  all  low  initially, 
since  the  reported  GPA  criterion  is  a  complex  variable  not  easily  predicted  by  a 
single  factor  of  analogical  reasoning.  Although  reported  GPA  might  not  have 
been  a  true  reflection  of  actual  GPA  (although  Thompson  and  Weiss,  1980,  data 
show  a  correlation  of  .59  between  the  two),  this  invalidity  should  not  have  af¬ 
fected  the  comparisons  made  in  this  study.  Additional  research  utilizing  dif¬ 
ferent  criterion  measures  is  recommended  to  further  investigate  the  generality 
of  the  results  obtained  here. 

Other  than  the  small  effect  of  the  certainty  variable  on  the  validity  coef¬ 
ficients  for  each  of  the  scoring  formulas,  there  appears  to  be  no  effect  of  the 
certainty  variable  on  the  probabilistic  test  scores.  However,  since  not  all  of 
the  variance  in  the  probabilistic  test  scores  can  be  accounted  for  by  the  "pure" 
knowledge  and  certainty  indices,  there  may  be  some  other  response  style  variable 
that  exerts  an  influence  upon  the  probabilistic  test  scores.  This  influence 
would  have  to  be  extremely  small,  though,  since  the  knowledge  and  certainty  in¬ 
dices  accounted  for  88%,  84%,  78%,  and  92%  of  the  variance  in  the  scores  ob¬ 
tained  from  the  spherical,  quadratic,  truncated  log,  and  PACA  scoring  formulas, 
respect ively . 


Choice  among  Scoring  Methods 

The  choice  among  the  five  scoring  methods  must  be  made  on  the  basis  of  va¬ 
lidity  coefficients,  the  reliability  coefficients,  and  the  factor  analysis  re¬ 
sults.  Since  there  were  no  significant  differences  between  any  of  the  validity 
coefficients,  these  coefficients  do  not  provide  support  for  any  one  scoring 
method.  In  terms  of  the  reliability  coefficients,  the  PACA  (and  its  equivalent 
AIKEN)  scoring  formula  yielded  scores  having  the  highest  reliability  coeffi¬ 
cients  of  all  of  the  scoring  methods. 


The  dependence  of  both  the  internal  consistency  reliability  coefficient  and 
the  one-factor  solution  on  the  interitera  correlation  suggests  that  scores  from 
the  scoring  formulas  with  the  highest  reliability  coefficients  would  also  have 
the  strongest  first  factors,  and  this  is  exactly  what  occurred  in  this  study. 
Hypothesizing  that  the  factor  extracted  represents  verbal  ability,  it  is  desir¬ 
able  that  this  factor  account  for  as  large  a  proportion  of  each  item's  variance 
as  possible.  The  factor  contribution  of  this  first  factor  was  greater  for  the 
two  scoring  methods  that  are  not  reproducing  scoring  systems  (PACA  and  AIKEN) 
than  for  the  three  scoring  methods  that  are  reproducing  scoring  systems. 

On  the  basis  of  these  results,  either  the  PACA  or  Aiken  scoring  methods  can 
be  recommended  for  use  with  mul ti pi  e-choice  items  with  a  probabilistic  response 
format.  Since  PACA  is  the  simplest  of  the  two  methods,  it  might  be  the  prefera¬ 
ble  scoring  method. 

Conclusions 

Test  scores  obtained  from  the  five  methods  of  scoring  multi pie- choice  items 
with  a  probabilistic  response  format  do  not  appear  to  be  affected  by  the  re¬ 
sponse  style  or  personality  variable  of  examinee  certainty  to  a  greater  degree 
than  scores  obtained  under  traditional  multiple-choice  instructions.  The  scor¬ 
ing  method  used  does  not  affect  the  validity  of  the  test  scores  but  does  appear 
to  affect  the  internal  consistency  of  the  scores.  Test  scores  obtained  using 
the  PACA  scoring  method  were  more  reliable,  simpler  to  compute,  and  as  valid  as 
those  obtained  from  the  other  scoring  methods;  therefore,  use  of  the  PACA  scor¬ 
ing  method  is  recommended  for  these  types  of  items. 

As  a  note  of  caution,  however,  one  of  the  three  reproducing  scoring  systems 
might  have  a  practical  advantage  over  either  the  PACA  or  AIKEN  scoring  formulas. 
In  a  situation  where  examinees  were  aware  of  the  scoring  formula  to  be  used  and 
where  the  scores  were  of  some  importance  to  the  examinee  (as  for  a  classroom 
grade  or  selection  procedure),  the  examinees  could  optimize  their  test  score 
using  the  reproducing  scoring  systems  only  by  responding  according  to  their  ac¬ 
tual  beliefs  in  the  correctness  of  each  alternative,  while  their  total  scores 
could  be  maximized  with  the  PACA  scoring  formula  by  assigning  the  maximum  proba¬ 
bility  of  1.00  to  the  one  alternative  they  thought  was  the  correct  one.  If  ex¬ 
aminees  were  expected  to  utilize  this  strategy,  one  of  the  reproducing  scoring 
systems  would  be  better  to  use  with  multiple-choice  items  with  a  probabilistic 
response  format.  Test  scores  obtained  from  the  spherical  reproducing  scoring 
system  were  more  reliable,  as  valid,  and  showed  a  stronger  first  factor  than 
scores  from  the  other  reproducing  scoring  systems.  Thus,  if  the  practical  situ¬ 
ation  requires  use  of  a  reproducing  scoring  system,  the  spherical  RSS  should  be 
used . 
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Appendix : 

Supplementary  Tables 


Table  A 

IRT  Item  Parameters  for 
Multiple-Choice  Analogy  Items 


1 1  em 
Number 

a 

b 

£ 

310 

.616 

-.483 

.20 

273 

.627 

2.062 

.20 

275 

.652 

1.617 

.21 

286 

.673 

2.407 

.09 

327 

.693 

1.129 

.22 

399 

.722 

.446 

.24 

419 

.7  50 

2.413 

.16 

278 

.770 

2.002 

.17 

266 

.815 

1.690 

.38 

271 

.828 

1.266 

.09 

268 

.844 

1.036 

.17 

3  92 

.865 

-.360 

.20 

492 

.914 

-.  145 

.12 

331 

.930 

1.352 

.20 

578 

.946 

.271 

.20 

405 

.983 

.739 

.16 

323 

1.005 

.828 

.20 

394 

1.006 

-.153 

.20 

277 

1.041 

1.930 

.17 

335 

1.075 

1.525 

.20 

575 

1.098 

.197 

.25 

560 

1.132 

-.007 

.27 

452 

1.  156 

-.341 

.30 

493 

1.172 

.076 

.26 

576 

1.211 

.633 

.20 

415 

1.234 

1.183 

.24 

322 

1.232 

.960 

.17 

2  50 

1.288 

.513 

.17 

284 

1.357 

2.232 

.24 

339 

1.608 

1.818 

.17 

Mean 

.975 

.961 

.20 

SD 

.244 

.887 

.06 

-  27  - 


Table  B 

Instructions  Given  Prior  to  Administration  of  Multiple-Choice 
Items  with  a  Probabilistic  Response  Format 


Screen  29891* 

That  completes  the  introductory  information. 

Type  "GO"  and  press  "RETURN”  for  the  instructions  for 
the  first  test. 

Screen  29842* 

This  is  a  test  of  word  knowledge.  It  is  probably  different 
from  other  tests  you  have  taken,  so  it  is  important  to  read 
the  instructions  carefully  to  understand  how  to  answer  the 
questions . 

Each  question  consists  of  a  pair  of  words  that  have  a  specific 
relationship  to  each  other,  followed  by  four  possible  answers 
consisting  of  pairs  of  words.  One  of  these  four  pairs  of 
words  has  the  same  relationship  as  the  first  pair  of  words. 

Type  "GO"  and  press  "RETURN"  for  an  example. 

Screen  29824* 

For  example: 

Hot: Cold 

1)  Hard: Soft 

2)  Horse: Building 

3)  Mule:  Horse 

4)  Yellow: Brown 

Your  job  in  this  test  is  not  to  choose  the  correct  answer 
(the  pair  of  words  that  has  the  same  relationship  as  the  first 
pair  of  words)  but  to  indicate  your  confidence  that  each  of 
the  four  answers  is  the  correct  answer. 

Type  "GO"  and  press  "RETURN"  to  continue  the  instructions. 
Screen  29804* 

You  indicate  your  confidence  by  distributing  100  points 
among  the  four  answers.  The  answer  you  think  is  the 
correct  one  should  get  the  highest  number  of  points,  and 
the  answer  you  feel  is  least  likely  to  be  the  correct  answer 
should  get  the  lowest  number  of  points. 

The  more  certain  you  are  that  an  answer  is  the  correct  one, 

the  closer  your  response  to  that  answer  should  be  to  100. 

The  more  certain  you  are  that  an  answer  is  NOT  the  correct 

one,  the  closer  your  response  for  that  answer  should  be  to  0. 


-continued  on  the  next  page- 


Table  B,  continued 

Instructions  Given  Prior  to  Administration  of  Multiple-Choice 
Items  with  a  Probabilistic  Response  Format 

If  you  are  completely  certain  that  one  of  the  answers  is  the 
correct  answer,  assign  100  to  that  answer  and  0  to  the  other 
answers  for  that  question.  If  you  are  completely  uncertain  as 
to  which  answer  is  correct,  assign  25  to  each  of  the  four 
answers . 

Type  "GO"  and  press  "RETURN"  to  continue. 

Screen  29805* 

The  numbers  you  distribute  among  the  four  answers  must  sum  to 
99  or  100.  However,  you  can  distribute  the  100  points  in  any 
way  you  like,  as  long  as  they  reflect  your  certainty  as  to  the 
"correctness"  of  each  answer. 

To  answer  a  question,  type  the  numbers  you  assign  to  each 
answer  in  a  line  in  the  order  in  which  the  answers  appear  in 
the  question.  Separate  each  number  by  a  comma. 

Type  "GO"  and  press  "RETURN"  for  an  example. 

Screen  29825* 

Going  back  to  the  sample  question: 

Hot: Cold 

1)  Hard: Soft 

2)  Housebuilding 

3)  Mule:Horse 
A)  Yellow: Brown 

Suppose  a  person  responded  with  the  following  numbers: 

?  80,0,0,20 
This  person  was: 

a)  fairly  sure,  but  not  completely  certain,  that 
the  first  answer  (Hard: Soft)  had  the  same 
relationship  as  the  pair  of  words  in  the 
question  and  thus  was  the  correct  answer. 

b)  completely  certain  that  answers  "2"  and  "3" 
were  NOT  the  correct  choice. 

c)  unsure  about  whether  or  not  the  fourth  answer 
was  the  correct  answer,  but  felt  that  it  was 
closer  to  being  an  incorrect  answer  than  the 
correct  answer. 

Note  that  80  +  0  +  0  +  20  «  100. 

Type  "GO"  and  press  "RETURN"  to  continue  the  instructions. 

-continued  on  next  page- 


Table  B,  continued 

Instructions  Given  Prior  to  Administration  of  Multiple-Choice 
Items  with  a  Probabilistic  Response  Format 


Screen  29826* 

Let's  look  at  this  question  once  more: 

Hot: Cold 

1)  Hard: Soft 

2)  Housebuilding 

3)  Mule:Horse 

4)  Yellow: Brown 

Suppose  a  person  responded  with  the  following  numbers: 

?  33,0,33,33 
This  person  was: 

a)  completely  certain  that  the  second  answer  was  NOT  the 
correct  answer. 

b)  unsure  as  to  which  of  the  renaining  answers  was  correct 
and  felt  that  any  of  the  remaining  three  answers  were 
equally  likely  to  be  the  correct  answer. 

Type  "GO"  and  press  "RETURN"  to  continue  the  instructions. 

Screen  29827* 

As  you  can  see,  there  is  an  almost  endless  variety  of 
combinations  of  numbers  that  you  may  use  to  state  your 
confidence  in  the  four  possible  answers.  Use  the  entire 
range  of  numbers  between  0  and  100  to  express  your 
confidence.  Remember  also  that  the  numbers  you  assign  to 
the  four  answers  must  sum  to  99  or  100. 

Please  ask  the  proctor  for  help  if  you  have  any  questions. 


Type  "GO"  and  press  "RETURN"  when  you  are  ready  to  start 
the  test. 
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